Heuristics for Broad-Coverage Natural Language Parsing
نویسنده
چکیده
1. I N T R O D U C T I O N As everyone who has tried it knows, the hardest part of building a broad-coverage parser is not simply covering all the constructions of the language, but dealing with ambiguity. One approach to ambiguity resolution is to "understand" the text well enough to have a good semantic interpretation system, to use real-world modeling, inference, etc. This can work well in small domains, and it is, in this author's opinion, ultimately necessary for the highest quality of natural language processing in any domain; but it is probably not feasible on a broad scale today. So some kind of heuristic method is needed for disambiguation, some way of ranking analyses and choosing the best. Even in the ideal model of human language processing (which would use a great deal of knowledge representation and inference), ranking heuristics seem appropriate as a mechanism since humans must work with incomplete knowledge most of the time. Two major questions that can be asked about a heuristic method for ambiguity resolution are these: 1. What level of representation is used for disambiguation and is involved in the statements of the heuristic rules lexical/morphological, surface syntactic, deep syntactic, or logical/semantic? 2. Where do the heuristic rules come from? Are they largely created through human linguistic insight, or are they induced by processing corpora? This paper describes the heuristic method used in the Slot Grammar (SG)system [10, 11, 13, 16, 17] for ambiguity resolution the SG parse scoring system. This scoring system operates during parsing (with a bottom-up chart parser), assigning real number scores to partial analyses as well as to analyses of the complete sentence. The scores are used not only for ranking the final analyses but also for pruning the parse space during parsing, thus increasing time and space efficiency. The level of representation being disambiguated is thus the level of SG parses. SG parse structures are dependencyor head-oriented, and include, in a single tree, both surface structure and deep syntactic information such as predicateargument structure, remote dependencies, control information, and unwinding of passives. 1 SG parse structures also include a choice of word senses. The extent to which these represent semantic sense distinctions depends on the lexicon. The SG system is set up to deal with semantic word-sense distinctions and to resolve them by doing semantic type-checking during parsing. However, in the lexicon for ESG (English Slot Grammar), nearly all word sense distinctions are a matter of part of speech or syntactic slot frame. Some semantic types are shown in the lexicon and are used in parsing, but generally very few. Thus one would say that ESG parse structures are basically syntactic structures, although the deep information like argument structure, passive unwinding, etc., counts for "semantics" in some people's books. Where do the SG scoring rules come from human linguistic insight or induction from corpus processing? The score of an SG parse, which will be described in Section 4, is the sum of several components. Most of these come completely from human linguistic insight, though some of them get their numeric values from corpus processing. In the tests reported in the final section, only the "linguistic-insight"rules are used. Some previous tests using the corpus-based heuristic rules together with the main SG heuristic rules showed that the former could improve the parse rate by a few percentage points. It is definitely worth pursuing both approaches, and more work will be done with a combination of the two. I No attempt is made to resolve quantifier scoping in SG parses, although there is a post-processing system that produces a logical form with scope resolution for quantifiers and other "focalizers"[12]. Anaphora resolution [8, 9] is also done by post-processing SG parses.
منابع مشابه
CHUMP: Partial Parsing and Underspecified Representations
Robustness is a highly desirable trait for natural language processing systems. Among other things, a robust system would have broad coverage and its output would degrade gracefully as inputs strayed from the area of coverage. Systems that attempt global optimization and complete syntactic and semantic analysis tend to have limited coverage and to degrade less than gracefully. In addition, as t...
متن کاملThe Grammar Matrix: An Open-Source Starter-Kit For The Rapid Development Of Cross-Linguistically Consistent Broad-Coverage Precision Grammars
The grammar matrix is an open-source starter-kit for the development of broadcoverage HPSGs. By using a type hierarchy to represent cross-linguistic generalizations and providing compatibility with other open-source tools for grammar engineering, evaluation, parsing and generation, it facilitates not only quick start-up but also rapid growth towards the wide coverage necessary for robust natura...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملDependency Parsing Features for Semantic Parsing
Current semantic parsing systems either largely ignore the syntactic structure of the natural language input or attempt to learn highly underconstrained, noisy syntactic representations. We present two new classes of features for statistical semantic parsing that utilize information about syntactic structure, extracted from a dependency parse of an utterance, to score semantic parses for that u...
متن کاملA Development Environment For Large-Scale Multi-Lingual Parsing Systems
We describe the development environment available to linguistic developers in our lab in writing large-scale grammars for multiple languages. The environment consists of the tools that assist writing linguistic rules and running regression testing against large corpora, both of which are indispensable for realistic development of large-scale parsing systems. We also emphasize the importance of ...
متن کاملA corpus-based connectionist architecture for large-scale natural language parsing
We describe a deterministic shift-reduce parsing model that combines the advantages of connectionism with those of traditional symbolic models for parsing realistic sub-domains of natural language. It is a modular system that learns to annotate natural language texts with syntactic structure. The parser acquires its linguistic knowledge directly from pre-parsed sentence examples extracted from ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1993